The Annals of Applied Statistics 
2011, Vol. 5, No. 2A, 1081-1101 
DOI: 10.1214/10-AOAS426 

© Institute of Mathematical Statistics. 2011 



SURVIVAL ENSEMBLES BY THE SUM OF PAIRWISE 
DIFFERENCES WITH APPLICATION TO LUNG 
CANCER MICROARRAY STUDIES 

By Brent A. Johnson 1 and Qi Long 

Emory University 

Lung cancer is among the most common cancers in the United 
States, in terms of incidence and mortality. In 2009, it is estimated 
that more than 150,000 deaths will result from lung cancer alone. 
Genetic information is an extremely valuable data source in charac- 
terizing the personal nature of cancer. Over the past several years, 
investigators have conducted numerous association studies where in- 
tensive genetic data is collected on relatively few patients compared 
to the numbers of gene predictors, with one scientific goal being to 
identify genetic features associated with cancer recurrence or survival. 
In this note, we propose high-dimensional survival analysis through 
a new application of boosting, a powerful tool in machine learning. 
Our approach is based on an accelerated lifetime model and minimiz- 
ing the sum of pairwise differences in residuals. We apply our method 
to a recent microarray study of lung adenocarcinoma and find that 
our ensemble is composed of 19 genes, while a proportional hazards 
(PH) ensemble is composed of nine genes, a proper subset of the 19- 
gene panel. In one of our simulation scenarios, we demonstrate that 
PH boosting in a misspecified model tends to underfit and ignore 
moderately-sized covariate effects, on average. Diagnostic analyses 
suggest that the PH assumption is not satisfied in the microarray data 
and may explain, in part, the discrepancy in the sets of active coeffi- 
cients. Our simulation studies and comparative data analyses demon- 
strate how statistical learning by PH models alone is insufficient. 

1. Introduction. In 2009, lung (and bronchus) cancer is projected to 
be the third most incident cancer site (behind prostate and breast) in the 
United States. The National Cancer Institute estimates that nearly 220,000 
men and women will be diagnosed with and nearly 160,000 men and women 



Received September 2009; revised September 2010. 

1 Supported in part by a grant from the National Institutes of Allergies and Infectious 
Diseases (R03AI068484) and Emory's Center for AIDS Research. 

Key words and phrases. Accelerated failure time, boosting, lasso, proportional hazards 
regression, survival analysis. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Statistics. 
2011, Vol. 5, No. 2A, 1081-1101. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



B. A. JOHNSON AND Q. LONG 



will die from lung and bronchus cancer in 2009 (seer.cancer.gov). Large data 
sets containing clinical, microarray, miRNA, and other genomic data for 
virtually all types of cancer are available online and one expects these data 
sets to multiply in the future. While the demand from scientific investigators 
for systematic summary of large data sets is high, there is a short supply 
of easy-to-use, out-of-the-box methods for survival outcomes. The goal of 
this note is to propose a new statistical learner for survival data by boosting 
a weighted concentration measure and applying the method to a challenging 
scientific problem in lung cancer. The routines developed for this paper 
complement the mboost package [Hothorn and Biihlmann (2007)] in R and 
are freely available from the first author's website. 

Boosting is a ubiquitous concept in machine learning and popular among 
statisticians for model fitting, prediction, and variable selection. While early 
applications were driven by problems in classification and discrimination 
[Preund and Schapire (1996, 1997); Breiman (1998)], it is now known that 
boosting applies to a general class of function estimation problems by stage- 
wise descent of a well-defined, convex loss function [cf. Friedman, Hastie and 
Tibshirani (2000); Biihlmann and Hothorn (2007)]. In this paper we propose 
rank-based boosting of survival data in a semi-parametric accelerated life- 
time or failure time (AFT) model [Cox and Oakes (1984); Kalbfleisch and 
Prentice (2002)]. In the linear AFT model, the (natural) logarithm of the 
lifetime Tj is related to a d- vector of predictors X, = (Xn, . . . , Xj^) T , that is, 

d 

(1.1) \ogT i = Y,^X i] +e i (i = l,...,n), 

i=i 

(ei, . . . ,e n ) are random errors from an unknown common distribution, and 
(3 = (j3i, . . . ,Pd) T is an unknown coefficient vector to be estimated. Without 
loss of generality, we assume that the predictors have been standardized to 
have mean zero and unit variance. The observed data are {(U, Aj, Xj), i = 
1, . . . , n}, where U = min(Tj, d), A, = /(Tj < Cj), Cj is a random censoring 
variable for the ith subject, and /(•) denotes the indicator function. Because 
the linear AFT model is based on the linear model, the coefficients and 
their estimates have an interpretation familiar to a broad audience. Indeed, 
Sir David Cox highlighted parameter interpretation when describing the 
appeal of the AFT model compared to parameters in hazards regression 
models [Reid (1994)]. Although boosting in the AFT model has already been 
described by Hothorn et al. (2006) via inverse-probability weighting (IPW), 
the method here is based on an entirely different principle and embodies 
different assumptions (see Section 2). In Section 6 we compare rank-based 
and IPW ensembles to assess how technical assumptions affect performance 
in statistical learners. 
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Ridgeway (1999) first proposed the idea of boosting survival data through 
Cox's (1972) partial log-likelihood in a proportional hazards (PH) model (see 
Section 3). In general, adopting the PH model and complementary partial 
likelihood analysis is the conventional method in ordinary survival regres- 
sion as well as their extensions to model selection and statistical learning. 
However, the proportional hazards assumption may be incorrect. When the 
PH assumption is incorrect but one proceeds with partial likelihood analysis, 
there can be serious side effects on classic statistical inference, resulting in 
incorrect conclusions [Lin, Wei and Ying (1993)]. Although the consequences 
of model misspecification on variable selection may be subtle for any given 
data set, a violation of the PH model assumption would, for example, obvi- 
ate the oracle property [Fan and Li (2002); Johnson, Lin and Zeng (2008)] 
for the relevant penalized partial likelihood estimators. One may use stan- 
dard tools to diagnose the PH assumption in survival data [cf. Schoenfeld 
(1982); Lin, Wei and Ying (1993); Grambsch and Therneau (1994)] and if 
deviations occur, then one can take corrective action through various relax- 
ations of the PH model. A second way to circumvent potential pitfalls in 
PH model misspecification is to posit a different statistical model, such as 
the AFT model. In Section 6 we compare and contrast coefficient ensembles 
across competing survival models using a variety of performance measures. 

This paper was originally motivated by the authors' collaborations with 
investigators who collect and analyze high-dimensional microarray data. Re- 
cently, Morris et al. (2005) analyzed microarray data collected from two 
studies, one conducted at Harvard University and another at the University 
of Michigan. Both studies used Affymetrix oligonucleotide arrays but used 
different versions of Affymetrix chips. Morris et al. pooled the data using 
a "partial probeset" method to match chip types and then used the pooled 
data to identify genes associated with mortality due to lung adenocarcinoma, 
a nonsmall cell form of lung cancer. The goal of our analysis is to develop 
a mortality model of genetic factors for lung adenocarcinoma. Morris et al. 
achieved this goal through one-gene-at-a-time Cox PH models and then con- 
trolled for false discovery. In Section 4 we perform simultaneous estimation 
and variable selection on the same microarray data via boosting. We find 
that PH boosting leads to a nine-gene model and rank-based boosting leads 
to a 19-gene model, with the former active set a proper subset of the latter 
model. Boosting the same data using IPW methods leads one to conclude 
that 94 genes are active, most of which do not appear in either of the other 
two methods. 

In addition to our substantive findings in the lung cancer data, we also 
make the following two methodological contributions. In Section 5 we pro- 
vide an analysis of nursing home data where the sample size far exceeds 
the number of predictors so we can apply standard PH model diagnostics. 



4 



B. A. JOHNSON AND Q. LONG 



The diagnostic tool fails to support the PH model assumption and vari- 
able selection using PH versus AFT models leads to rather different sets 
of active covariates. Later in the same section, we apply blackbox boosting 
using rank-based methods to breast cancer data [Street, Mangasarian and 
Wolberg (1995)] and compare our results to IPW boosting [Hothorn et al. 
(2006); Biihlmann and Hothorn (2007)]. This analysis exemplifies key dif- 
ferences between rank-based and IPW methods even when one adopts the 
same AFT model. To the best of our knowledge, this is the first paper to 
execute nonlinear regression in the AFT model using the rank-based Gehan 
loss and regression trees as base learners. 

2. Methods. 



2.1. Boosting. For many applications, a common goal is to estimate the 
population minimizer, 

(2.1) /„(•) = argmin£[p{[7, A, /(X)}], 

where p is a convex loss function, differentiable with respect to /, and / is 
a function to be estimated in the generalization of model (1.1), that is, 

(2.2) logT i = /(X l )+e J , 

and £j were described in Section 1. We assume that the observed data 
{(Ui, Aj, Xj), % = 1, . . . , n} are a random sample of observations from a com- 
mon distribution function; hence, an ordinary strong law suggests the ex- 
pectation on the right-hand side of (2.1) is well approximated by a sample 
average. Then, the goal of boosting is to minimize the empirical loss func- 
tion, that is, 

1 n 

(2.3) /(•) = argmin - V p{U u A;, /(X,)}. 

n ^— ' 

i=l 

As long as p(-) is convex, we note that the computational exercise in (2.3) is 
well defined even when the population parameter /o(-) in (2.1) is awkward 
or difficult to interpret. 

With only minor notation changes, we briefly outline functional gradient 
descent (FGD) as given in Friedman (2001) and Biihlmann and Hothorn 
(2007): 

(51) Initialize /M = 0. 

(52) Increment m by 1 and compute the negative gradient —(d/df)p{U, 
A, /(X)}. Define Zi as the evaluation of the negative gradient at /[ m_1 l(Xj), 
that is, 



Z l = -—p{U i ,A i ,f} 



/=/V-i](Xi) 



for i = 1, ... ,n. 
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(53) Fit the new pseudo data {{Z{, A,, Xj), i = 1, ...,n} through a base 
procedure to form the ensemble update, g^ m \-). 

(54) Define the ensemble iterate fi m \-) = f^ m ~ l \-) + v ■ (•) , where v is 
a user-defined step-length factor less than or equal to 1 and strictly greater 
than 0. 

(55) Iterate from steps S2 to S4 for a user-defined number of iterations, 
that is, "m stop ." 

The final estimate / is driven by the number of iterations m s t p- Hence, 
"T-stop is a parameter that requires tuning: fewer iterations lead to simple 
models but worse prediction, while increasing m stop increases model com- 
plexity and eventual overfitting. 

2.2. The Gehan loss function. We motivate our loss function through 
coefficient estimation in the semi-parametric AFT model (1.1). In the case 
where d = l, Prentice (1978) proposed linear rank tests for the null hypothe- 
sis that the slope is zero. Tsiatis (1990) inverted the linear rank tests to form 
a class of weighted logrank estimating functions. For a particular choice of in- 
efficient weight function, the weighted logrank estimating function reduces to 

n n 

(2.4) n" 2 £ E A *( X * " Xj^teG 9 ) ^ e 3 (P)}- 

i=l j=i 

We note that (2.4) is the d-dimensional gradient of the following Gehan-type 
(1965) loss function: 

n n 

(2.5) """'EE MeiO) - e 3 {P)}I{em < ej (f3)}, 

1=1 j=l 

where ej(/3) = logf/j — /3 T Xj. If we relax the restriction that /(•) is a linear 
predictor, expression (2.5) is still a proper convex loss function and reduces 
to Jaeckel's (1972) dispersion criterion in uncensored data. In the context of 
boosting, we define the Gehan loss as the following weighted sum of pairwise 
differences: 

n n 

DgU) = " n " 2 EE - < ej) 

i=l j=l 

(2.6) 

n I 

-n~ 1 A i y^(ej - ej)I(ei < ej) 

3=1 

where e, = logf/j — /(Xj). We compute the negative gradient in step (S2) in 
Section 2.1 as the following difference: —(d/df)Dc = — (Ti — I^/ra, where 
Ti = Aj Yll=i I( e i — e j) an< ^ F2 = Y^=i AjH e i ^ e j)- Now, we see that the 
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definition of p{Ui, Aj, /(Xj)} from the prototypical boosting algorithm in 
Section 2.1 is the expression in curly brackets on the right-hand side of (2.6). 

2.3. Parameter tuning. We require a criterion whereby we can assess 
model fit in terms of error and complexity. Because Dc(f) is a convex loss 
function, we may use it to simply perform U-fold cross-validation (CV). 
Unless otherwise specified, we adopt 5-fold CV to tune m s t op for all data 
analyses and simulation studies below. 

As noted by an anonymous referee, boosting is known to be a slow learner 
and hence slow in convergence [cf. Biihlmann and Yu (2003); Blanchard, Lu- 
gosi and Vayatis (2004); Zhang and Yu (2005)]. This suggests that for a fixed 
step-length factor v in step (S4) of the boosting algorithm described in Sec- 
tion 2.1, convergence occurs only after a large number of boosting iterations 
and will eventually overfit if boosting iterates indefinitely. In practice, we 
found that convergence was more difficult for "small" data sets (i.e., larger 
number of iterations needed for large n and small d) than for data sets 
with large numbers of predictors, although this will be more closely related 
to the signal-to-noise ratio, in general. For fixed step-length u, our experi- 
ence suggests that F-fold CV is quite reliable for parameter tuning. Where 
asymptotic analysis suggests allowing v — > 0, we adopt Biihlmann's recom- 
mendation of setting v = 0.1. A simple sensitivity analysis revealed that co- 
efficient ensembles were rather insensitive to mild differences in step-length. 

3. Loglinear vis-a-vis hazards regression for lifetime data analysis. Rather 
than assert model (1.1), a popular alternative for lifetime data is to model 
the hazard function, A(i,Xj) = lim/ l ^ P r (^ < Ti < t + h\Ti > t,Xj). Cox's 
(1972) proportional hazards (PH) assumption asserts that 



where Ao(t) is an arbitrary function of time. Because one models the hazard 
function in (3.1), the coefficients are interpreted on a log relative risk scale. 
Regardless of coefficient interpretation, the maximum partial likelihood es- 
timator minimizes the following convex loss function: 



and coefficient ensembles are constructed accordingly. Note that when the 
errors in (1.1) are normally distributed, the proportional hazards model 
in (3.1) is misspecified. Similarly, it is easy to construct distributions where 
the PH model is correct and the loglinear model in (1.1) is incorrect. Both 
models are correct only when the distribution of the lifetime variable is 
an extreme value. Graphical displays and formal hypothesis tests for the 




A(t, X) = A (t) exp(/3iX a + ■ • ■ + p d X id ) 



(3.2) 
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(in)validity of the PH model have been a research topic for survival enthusi- 
asts for more than four decades. The PH model can fail in one of three ways 
[cf. Lin, Wei and Ying (1993)]: (a) the PH assumption, (b) the functional 
form of predictors, and (c) the link function. Any violation can have serious 
side effects on partial likelihood inference, including inefficient coefficient es- 
timates, hypothesis tests with the wrong size, and confidence intervals with 
the wrong coverage [Lin, Wei and Ying (1993)]. In addition to numerous re- 
search papers on detecting deviations from the PH model, some text books 
are dedicated to the topic [cf. Therneau and Grambsch (2000)] and routine 
tools are available in standard software. In our analysis of the microarray 
data in Section 4, we use a diagnostic tool developed by Grambsch and Th- 
erneau (1994) that investigates whether the covariate effect is constant over 
time. 

4. Analysis of microarray data. As described in Section 1, the goal of 
our microarray analysis is to summarize the association of p = 1036 gene 
expression levels with survival time. The combined data set from Harvard 
and Michigan consists of microarray data for n = 200 patients with 46.5% 
observations uncensored. We adopt boosting methods that assume linearity 
in the functional predictor /(X) = ^ ■ fijXj and, hence, coefficient estimates 
in the AFT model have the interpretation of increase in average (logarithm) 
survival time for a standard deviation (i.e., one unit) increase in gene Xj 
holding other factors fixed. 

First, we use heuristic methods to investigate the tenability of the Cox PH 
model assumption in the microarray data. All of our investigations are based 
on the cox.zph function [Grambsch and Therneau (1994)] in R which tests 
the specific PH model assumption that the log relative hazard is constant 
over time. Unless otherwise stated, we report the cox.zph global p- value 
for the full model. Similar to Morris et al., we begin by fitting Cox models 
one gene at a time. For each model, we record the p-value for the score 
test as well as the p- value for the diagnostic test; the results are displayed 
in Figure 1. The left panel in Figure 1 is a histogram of the p- values for 
all 1036 diagnostic tests. A total of 108 out of 1036 (10%) diagnostic tests 
rejected the null hypothesis at the nominal 0.05-level, that is, about 5% 
more than we would expect by chance alone. If the PH model assumption 
were true marginally for each gene in the microarray data, we would expect 
the p-values in the left panel of Figure 1 to be approximately uniformly 
distributed. The right panel of Figure 1 displays the p- values from the score 
test by diagnostic p- values from Cox regression fits to the univariate models. 
We find that 16 of 108 genes (15%) are declared to be significantly related 
to survival time at the nominal 0.05-level, but for which the diagnostic test 
rejects the PH model assumption of constant relative risk. Because the score 
test is not appropriate for 108 of the univariate models, it is unknown how 
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Score vs. Zph P-values 
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p-values from cox.zph() score 



Fig. 1. Results from diagnostic analyses on the microarray data. 

many of the 16 genes are not significant at the nominal level nor how many 
of the 92 genes which are not declared significant by the score test are, in 
fact, marginally associated with survival time. 

Because of multiple testing, it is known that drawing conclusions from 
univariate models is naive. We further investigated the tenability of the PH 
model assumption in multiple regression models using p-values from both 
the global test as well as individual composite hypothesis tests that the 
log relative hazard for a particular gene is constant given other genes in 
the model. The p-value for the global test did not drop below the nominal 
0.05-level until more than 80 genes entered the model. However, there was 
marginal evidence (p- value = 0.06) against the PH assumption using results 
from the composite tests with only the top three genes. After one fits 20 or 
more of the most significant genes into multiple regression Cox models, there 
is always strong evidence (p-value < 0.05) that at least one gene does not 
follow the PH model assumption given other significant genes in the model. 

We then investigated the sensitivity of the global diagnostic test to the 
addition of noise variables. Here, we took one of the 16 genes in Figure 1 
that was associated with survival time but did not satisfy the PH model 
assumption in a univariate regression fit; the first one in the data set was 
the 85th gene, so we chose that one. We then fit a Cox PH model with the 
covariate vector corresponding to this gene and an increasing number of noise 
vectors, that is, n-dimensional vectors of standard normal random variables. 
The p- value for the global test was 0.03, 0.09, and 0.16 with no, one, and two 
noise vector (s), respectively. The p- value for gene 85 from the Grambsch- 
Therneau composite score test tended to stay closer to the nominal level, 
although not always less than 0.05. Hence, while our application of the 
diagnostic tool is imperfect for high dimensional data analyses, there does 
seem to be some evidence contrary to the PH model assumption in joint 
analyses of the most significant genes identified through univariate regression 
fits. A prudent approach is to consider both the Cox PH and AFT models 
and we now describe results from those analyses. 
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Table 1 

Coefficients estimates for microarray data 



Gene 


PH 


Rel. prop. 


Gehan 


Rel. prop. 


Hs.34789 


-0.216 


100.0 


0.166 


100.0 


Hs. 146580 


0.113 


52.3 


-0.031 


18.7 


Hs.119000 


0.101 


46.8 


-0.079 


47.6 


riS.4Uo013 


0.053 


24. 


— U.022 


1 Q Q 

13.3 


Hs.407995 


0.052 


24.1 


-0.098 


59.0 


Hs.75106 


-0.041 


19.0 


0.007 


4.2 


Hs. 174185 


-0.025 


11.6 


0.035 


21.1 


Hs.2962 


0.024 


11.1 


-0.057 


34.3 


Hs.2934 


0.024 


11.1 


-0.107 


64.5 


Hs. 28491 







-0.039 


23.5 


Hs.82045 







0.007 


4.2 


Hs.576 







0.021 


12.7 


Hs. 14231 







0.070 


42.2 


Hs.57301 







-0.038 


22.9 


Hs. 13046 







-0.023 


13.9 


Hs. 36602 







0.014 


8.4 


Hs.301132 







0.058 


34.9 


Hs. 180107 







0.035 


21.1 


Hs.405945 







-0.036 


21.7 



In Table 1 we report the regression coefficients from the final model after 
5-fold CV for each of the PH and rank-based boostings. The third and fifth 
columns present the ratio of a given coefficient over the largest coefficient 
in the active set. We see that hazards regression selects a nine-gene panel, 
while boosting the sum of pairwise differences selects a 19-gene panel, with 
the latter panel a proper superset of the former panel. We see that the 
nine genes in the PH model are not the strongest nine genes in the 19-gene 
panel. Indeed, the sixth most significant gene in the PH model is the least 
significant among all 19 in the rank-based model. Moreover, the relative 
proportions indicate that the panel composition is totally different between 
the two methods. 

We intended to display all three methods — PH, Gehan, and IPW — side- 
by-side in Table 1, but the IPW active set included 94 genes and the table 
could not easily fit on one page. In Table 2 we display the set differences 
among final models by all three methods. IPW shares only three genes in 
common with the PH model and five genes in common with the rank-based 
model. More than 95% of the genes in the IPW model do not belong to 
either PH or rank-based models. Thus, for our microarray data analysis, the 
method proposed by Hothorn et al. (2006) leads to very different conclu- 
sions than the method proposed by Ridgeway (1999) and the one proposed 
here. 
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Table 2 

Set differences (A-B) among PH, Gehan, and I PW methods 



Set A 


Total 




Set B 




PH 


Gehan 


IPW 


PH 


9 







6 


Gehan 


19 


10 




14 


IPW 


94 


91 


89 





5. Comparative data analyses in large samples. In this section we pro- 
vide two more real data examples where rank-based ensembles lead to dif- 
ferent conclusions than PH or IPW ensembles. Unlike the microarray data 
set, the following data sets have many more observations than predictors. 
A consequence of the large sample size is that we may apply cox.zph di- 
rectly to the entire data set, as seen in the first comparative data analysis. 
In the second analysis, we demonstrate that differences between rank-based 
and IPW ensembles transcend base learner and affects the predictive scores 
significantly. 

5.1. Analysis of nursing home data. From 1980-1982, the National Cen- 
ter for Health Services Research conducted a study to determine the effect of 
financial incentives on variation of patient care in nursing homes. In partic- 
ular, 18 out of 36 nursing homes from San Diego, California, received higher 
per diem payments for accepting and admitting Medicaid patients and addi- 
tional bonuses when the patient's prognosis improved. The study collected 
data from an additional 18 control nursing homes where no financial incen- 
tives were used. A complete description is given in Morris, Norton and Zhou 
(1994). The total sample size from all 36 nursing homes is n = 1601. 

Our data set consists of seven main effects and six 2- way interactions. The 
main effects are treatment (trt), age, sex, marital status, and three health 
status indicators, ranging from the best health to the worst health. The 2- 
way interactions are possible interactions among treatment, age, sex, and 
marital status. This data set was previously analyzed by Fan and Li (2002) 
using the PH model. Using the Grambsch and Therneau (1994) diagnostic 
test (cox.zph in R) for proportional hazards suggests that the PH model 
is inadequate (global p- value = 0.003). This does not prove the AFT model 
is correct but encourages us to look beyond the PH model in performing 
variable selection. 

Table 3 presents the results from our analysis of the nursing home data. 
We performed boosting in the PH and AFT models even though our pre- 
liminary analysis indicated the inadequacy the former model. Within the 
AFT model, we performed coefficient ensembles using both the Gehan and 
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Table 3 

Boosted coefficients estimates from the nursing 
home data 





Cox 


Gehan 


IPW 


trt 


-0.018 


0.060 


0.309 


age 


-0.086 


0.152 


0.109 


sex 


0.165 


-0.283 


-0.129 


married 


0.061 


-0.066 




hi 








h2 


0.098 


-0.215 


-0.160 


h3 


0.157 


-0.291 


-0.212 


trt* age 


0.017 






trt*sex 


-0.015 






trt*married 




-0.027 


-0.124 


age*sex 


0.068 


-0.143 


-0.112 


age*married 


0.021 


-0.005 


-0.048 


sex*married 









IPW estimators and noted that the two models agree with the exception 
that IPW does not include the effect of marital status in the final model. 
A major difference between IPW and Gehan estimates is the magnitude of 
the treatment effect: three times the effect of age by the former method 
and less than half the age effect in the latter method. In comparing the 
Cox and Gehan coefficient ensembles, we note that the inclusion and exclu- 
sion of all seven main effects agree. However, the presence or absence of all 
three interactions involving treatment is reversed in the final models. Both 
IPW and Gehan estimates exclude treatment-by-age and treatment-by-sex 
interactions but include the treatment-by-marital status interaction. The 
relative magnitude of all two-way interactions is modest to moderate with 
the age-by-sex interaction being strongest. 

5.2. Blackbox methods on the Wisconsin PBC data. Regression trees are 
the most common base procedure in the machine learning community [Fre- 
und and Schapire (1996, 1997); Bxihlmann and Hothorn (2007)] and non- 
parametric procedures are gaining popularity in applications with complex 
data. In this section we compare rank-based boosting to existing proce- 
dures for survival outcomes using the Wisconsin Prognostic Breast Cancer 
(WPBC) data set. The WPBC data set was contributed by Street, Man- 
gasarian and Wolberg (1995) for developing diagnostic models of breast can- 
cer recurrence and is available from the UCI repository for machine learning 
data bases. The survival outcome is time to breast cancer recurrence and 
30 predictors describe features of cell nuclei taken from a digitized image of 
fine needle aspirate of breast mass. This data set was analyzed previously by 
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Parameter Tuning Gehan Vs. IPW 




mstop Gehan 

Fig. 2. Blackbox methods applied to the Wisconsin PBC data. Parameter tuning for 
Gehan loss is displayed in the left panel, while predictive scores between Gehan and IPW 
losses are illustrated in the right panel. Blue circles denote predicted scores from uncensored 
observations, while the red crosses denote censored observations. 

Biihlmann and Hothorn (2007) using inverse-probability weighting (IPW) 
methods. 

Using regression trees as base learners, we boosted the following loss func- 
tions: IPW with L2 loss, IPW with L\ loss, and the sum of pairwise differ- 
ences of absolute residuals (Gehan) in (2.6). Because IPW boosting with L\ 
loss led to problems in identifiability, only the results from L2 loss are pre- 
sented below. Figure 2 summarizes output from the data analyses. The left 
panel illustrates parameter tuning for boosting the Gehan loss with regres- 
sion trees as base learners and shows that the optimal m st0 p value was 197 
iterations when v = 0.1. The right panel compares the optimal fitted val- 
ues /(Xj) from blackbox Gehan and IPW methods using regression trees 
as base learners. The blue circles denote predicted scores from uncensored 
observations, while the red crosses denote censored observations. We note 
considerable disagreement between the risk scores even though the same 
data, same AFT model, and same boosting method are used in the statisti- 
cal learner — the only difference is in the loss function. Pearson's correlation 
coefficient was only r = 0.45. A total of 66 (33%) scores had different signs 
depending on the method; 52 (26%) scores had absolute difference greater 
than one unit and ten (5%) scores were greater than two units apart. Of the 
ten observations whose risk scores were more than two units apart, nine ob- 
servations were censored. It is evident that boosting different loss functions, 
particularly ones that embody different assumptions about the underlying 
data, can result in different estimated parameters and with potentially dif- 
ferent conclusions. We repeated the analysis using smoothing splines as base 
learners and found similar conclusions to the ones reported in Figure 2; thus, 
the results from smoothing splines are not shown. 
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6. Simulations. 

6.1. Comparisons to partial likelihood. We performed numerous simula- 
tion studies to compare coefficient ensembles obtained from boosting the 
partial likelihood to ensembles obtained from boosting the Gehan loss func- 
tion. We simulated data according to the AFT model, 



where d = 8, the coefficient vector (3 = (3, 3/2, 0, 0, 2, 0, 0, 0) x k, and the er- 
ror distribution was one of standard normal, extreme value (i.e., log Weibull 
with unit shape), or a mixture distribution. The mixture distribution was 
standard normal contaminated by a Student's t on three degrees of freedom 
and contamination is controlled by a Bernoulli indicator with success prob- 
ability 0.2. The predictors are distributed multivariate normal with mean 
zero and covariance cov(Xj,Xk) = (l/2)l J ~ fc L The constant k controls the 
magnitude of the coefficient vector and hence the signal-to-noise ratio; here, 
we considered k equal to 1/4, 1/2, 3/4, and 1. A total of 100 Monte Carlo 
data sets were computed for each sample size of n = 60, 80, and 100. 

We evaluated estimators based on ubiquitous performance measures from 
the variable selection literature: model error (ME), mean squared error 
(MSE), the average number of correct zeros (C), and the average number of 
incorrect zeros (I). Regression coefficient estimates from boosting the partial 
likelihood are multiplied by minus one so that both estimators are estimat- 
ing the true coefficient vector /3 under extreme value error distributions. 
In the linear model, prediction error is written as the sum of model error 
plus noise, where model error is defined ME = ((3 — /3 ) T i?(XX T )(/3 — /3 ). 
Although this performance measure is imperfect outside the linear model, it 
complements the other measures in a manner familiar to many statisticians. 
The median ME (MME) is presented in Figure 3. The MSE is defined as 
ME with the d-dimensional identity matrix replacing the covariance matrix, 
£(XX T ). The definition of correct and incorrect zero is straightforward; 
a larger number is better for the former measure, while a smaller number is 
better in the latter. 

We performed coefficient selection and estimation in a variety of sim- 
ulation scenarios with and without censoring, noting that censored data 
methods apply to complete data as well. In Figure 3 we present the results 
for the uncensored data case so that we might compare our results to a rou- 
tine implementation of L2 boosting. So, the three curves in Figure 3 refer 
to L2 (L) boosting, rank-based boosting (R), and boosting in the Cox (C) 
PH model. 



(6.1) 




j'=i 
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Fig. 3. Simulation results in uncensored data. Coefficient ensembles displayed for the Cox 
(C) PH model, rank-based (R) and L2 (L) boostings in a linear model (see text). Abscissa 
is expressed in k ("kappa"), a multiplicative constant on the coefficients in simulation 
studies, while ordinates refer to median model error (mme), average number of correct 
(C) and incorrect (I) zeros, with the last measure on a square-root scale. 



Our simulation results indicate that as the magnitude of the true coef- 
ficient vector increases, the ME and MSE for Li and rank-based boosting 
remain about the same while that for the PH model increases dramatically. 
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This result was true regardless of the error distribution, even when the PH 
model assumptions were satisfied. However, the gap in MSE between rank- 
based boosting and boosting the Cox model decreases as the sample size n 
increases. Interestingly, boosting in the Cox model has a tendency to favor 
sparse models when effect sizes are small and identifies models of similar 
complexity when effects sizes are moderate to large. Again, there was sig- 
nificant agreement in trends of model complexity across error distributions. 
If one considers all four performance measures together, then rank-based 
boosting is preferred. If one dismisses ME and MSE as unfair performance 
measures across PH and AFT models, the model complexity via rank-based 
boosting is less sensitive to the magnitude of the effect size. In the presence 
of many small effects, the Cox model will identify a larger number of correct 
zeros but simultaneously overlook a larger proportion of true effects. 

Next, we simulate censored data and no longer consider L2 boosting. 
We again simulate uncensored data according to the linear model in (6.1) 
with autoregressive design and normal errors. True regression coefficients 
are generated in two clusters according to the following rule: 

• for h = 1, . . . , 4, set initial coefficients (3<i + k h = Pi3+k h = {h — k) 2 , for \k\ < 
h, 

• multiply initial coefficients by a constant to yield theoretical R 2 = 3/4, 
where theoretical R 2 for random design is 

r2 _ /3^(XX T )/3 

with a the standard deviation of £j. Under this simulation scenario, the 
signal strength remains the same, while the proportion of active variables 
changes with h = 1, . . . ,4. Censored random variables were uniformly dis- 
tributed Un(0,r) to yield about 25% censoring. As in our earlier Monte 
Carlo studies, we monitor the mean-squared error (MSE), average number 
of correct (C) and incorrect (I) zeros. We also monitor the average false 
selection rate (FSR), computed as the proportion of unimportant variables 
relative to the cardinality of the active set. A summary over 100 Monte 
Carlo data sets for each of Model H1-H4 is displayed in Figure 4. 

As in our earlier simulations, estimating regression coefficients under 
an incorrect proportional hazards assumption can lead to substantial bias. 
Hence, the MSE is much higher using Cox PH compared to rank-based esti- 
mation, but the bias decreases as the proportion of active variables increases. 
Remarkably, ensembles via Cox PH were mildly better than rank-based en- 
sembles in identifying correct zeros, but the better performance came at 
a price of incorrectly setting important variables to zero. Hence, PH ensem- 
bles tend to select models that are too sparse under a normal-theory linear 
model and many moderate effects are ignored completely. 
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Fig. 4. Simulation results from comparing Cox (C) PH ensembles to rank-based (R) 
ensembles in uncensored data. Abscissa is labeled in terms of Models H1-H4, with an 
increasing number of nonzero coefficients for fixed theoretical R 2 . Ordmates refer to mean 
squared error (mse), average number of correct (C) and incorrect (I) zeros, and the false 
selection rate (fsr). 



6.2. Comparisons to inverse-probability weighting. We performed sepa- 
rate simulation studies to compare rank-based boosting and inverse-probabi- 
lity weighted (IPW) boosting [Hothorn et al. (2006)] in the AFT model. We 
simulated data according to a similar AFT model used in earlier simulation 
studies, 

d 

logT = J2^X j + ae, 
i=i 

where the distribution of e follows one of standard normal, log Weibull 
with unit shape, or Student's t on three degrees of freedom. Two modeling 
differences are that we fixed the constant n = 1 and, hence, the coefficient 
vector is /3 = (3,3/2,0,0,2,0,0,0) and varied the signal-to-noise ratio by 
increasing the scale parameter a from one-half to two in increments of 0.5. 
We again used an autoregressive design for the matrix of predictors. The 
performance measures are identical to those discussed above. 

The IPW boosting method by Hothorn et al. (2006) depends on cor- 
rectly modeling the (conditional) censoring mechanism. But the procedure 
implemented in their mboost R package makes the strong assumption that 
censoring is independent of failure times. Here, we simulate such data by 
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generating uniform censoring random variables, that is, C ~ Un(0,5). Sim- 
ulation results under the independent censoring assumption are provided in 
Figure 5. A weaker assumption is to suppose that censoring is condition- 
ally independent of failure time given covariates and we simulate such data 
through the model, C = /3 T X + Un(0,2). Figure 6 summarizes simulation 
results under the latter modeling assumptions. We report results for a sam- 
ple of size n = 60 independent observations. In summary, the simulation 
scenarios are as follows: 

• the error e is distributed as one of standard normal, extreme value, or £3; 
censoring is independent of failure time and C ~ Un(0, 5); 

• the error e is distributed as Student's t on r degrees of freedom, r = 
1,3,5,10,15,20; censoring is conditionally independent of failure time 
given covariates and is modeled C = /3 T X + Un(0, 2). 

When the stronger independent censoring assumptions are satisfied, Fig- 
ure 5 suggests that IPW boosting is a better procedure than rank-based 
boosting when the error distribution has light tails but not when the error 
distribution has heavy tails. Of course, the IPW methodology is general and 
a more robust version of the IPW may fix the deficiencies seen in the last 
column of the results in Figure 5 [Hothorn et al. (2006)]. However, Figure 6 
suggests that excessive bias in IPW coefficient ensembles is not easily reme- 
died by merely swapping loss functions. In Figure 6 we see that the ME and 
MSE plateau around 1.9, while the rank-based Gehan procedure plateaus at 
a value less than one-half. Under the second simulation scenario, our results 
suggest that IPW procedures produce estimates with excessive bias and pre- 
fer models that are too simple. If a squared error loss function is used to 
model lifetime data, the IPW estimates will be sensitive to outlying lifetime 
values. 

7. Remarks. High-dimensional survival analysis is an important appli- 
cation of the boosting machinery. In addition to placing weak restrictions 
on the predictive function /(•), boosting applies to any well-defined convex 
loss function. For survival outcomes, one's first inclination may be to adopt 
Cox's (1972) proportional hazards (PH) model and estimate /(•) by mini- 
mizing the negative log partial likelihood. It is known that the PH model 
assumptions can fail when n> p and the problem persists for large-scale 
regression problems too. Diagnosing the veracity of the PH model assump- 
tions in high dimensions is an intriguing research problem and beyond the 
scope of the current manuscript. When the proportional hazards assump- 
tion is inadequate or false, the accelerated failure time (AFT) model may 
be a reasonable alternative. In Section 6 we illustrated the effects on vari- 
able selection when the PH model is misspecified but the AFT model is 
correctly specified. Head-to- head comparisons of coefficient ensembles via 
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Fig. 5. Simulation results from comparing inverse probability weighting (W) versus 
rank-based (R) boosting via Gehan loss when model assumptions are satisfied for both 
procedures. Abscissa label is "sigma, " the scale parameter on the errors in simulation 
studies; ordinate labels are median model error (mme), average number of correct (C) and 
incorrect (I) zeros. 



hazards versus linear modeling suggest that the former identifies a slightly 
higher proportion of correct zeros and has a lower proportion of unimportant 
variables in final models even when the PH model assumption is incorrect. 
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Fig. 6. Simulation results from comparing inverse probability weighting (W) versus 
rank-based (R) boosting via Gehan loss when IPW model assumptions are violated. Ab- 
scissa label is degrees of freedom (df) of a Student's t distribution from which the outcomes 
were generated; ordinate labels are median model error (mme), mean squared error, aver- 
age number of correct (C) and incorrect (I) zeros. 



At the same time, in addition to expected excessive bias in misspecified 
models, we find that PH ensembles tend to select models that are too sparse 
and, hence, ignore many moderate effects. 

Hothorn et al. (2006) discussed boosting survival data in the semi-parame- 
tric AFT model via inverse probability weighting (IPW). The coefficient en- 
sembles by Hothorn et al. (2006) are built on a theory of inverse probability 
weighting (IPW), a powerful technique for general missing data problems 
[van der Laan and Robins (2003); Tsiatis (2006)]. The idea is to model the 
censoring mechanism as a function of covariates and then weight the uncen- 
sored data by the reciprocal of the estimated "complete case" probability. If 
the censoring mechanism does not depend on covariates, then the modeling is 
accomplished nonparametrically via Kaplan-Meier estimation. In contrast, 
the ensemble methods proposed here are based on minimizing rank-based 
dispersion criteria and do not require modeling the censoring mechanism. 
Both IPW and rank-based ensemble methods minimize convex loss functions 
and fit neatly into the boosting template; however, rank-based methods op- 
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erate under less stringent conditions. Although second-stage modeling can 
improve the simple inverse-weighting method proposed by Hothorn et al. 
(2006), the secondary models would be difficult to verify and computational 
details left to the user. Our simulation studies indicated that the boost- 
ing method by Hothorn et al. (2006) is as good or better than rank-based 
boosting in special cases but not in general. 

In conclusion, our analyses and simulation studies indicate that each of 
PH, rank-based, and IPW survival ensembles can exhibit the best and worst 
learning behavior depending on the model, the data, and how one evaluates 
model performance. We were surprised that boosting in Cox's PH model 
performed as well as it did even in a misspecified linear model with normal 
errors. But even in this simple simulation scenario, we were able to replicate 
the types of difference betweeen PH and rank-based ensembles and potential 
underfitting behavior similar to what we saw in our microarray data set. We 
feel that rank-based survival ensembles have merit and will provide scientists 
with a strong tool for investigating large data sets in a wide variety of 
settings. 
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